Authorship Attribution using Variable Length Part-of-Speech Patterns

نویسندگان

  • Yao Jean Marc Pokou
  • Philippe Fournier-Viger
  • Chadia Moghrabi
چکیده

Identifying the author of a book or document is an interesting research topic having numerous real-life applications. A number of algorithms have been proposed for the automatic authorship attribution of texts. However, it remains an important challenge to find distinct and quantifiable features for accurately identifying or narrowing the range of likely authors of a text. In this paper we propose a novel approach for authorship attribution, which relies on the discovery of variable-length sequential patterns of parts of speech to build signatures representing each author’s writing style. An experimental evaluation using 10 authors and 30 books, consisting of 2,615,856 words, from Project Gutenberg was carried. Results show that the proposed approach can accurately classify texts most of the time using a very small number of variable-length patterns. The proposed approach is also shown to perform better using variable-length patterns than with fixed-length patterns (bigrams or trigrams).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams

Computer-supported authorship attribution provides tools for extracting stylistic features that can help verify or identify the author of text documents. In many situations finding the author of a document is very important, such as the detection of plagiarism for protecting copyrights and forensic support during criminal investigations. This paper, thus explores a novel stylistic feature with ...

متن کامل

LitLin 18_4 423-447 fqh009 FIN

Large, real world, data sets have been investigated in the context of Authorship Attribution of real world documents. Ngram measures can be used to accurately assign authorship for long documents such as novels. A number of 5 (authors) 5 (movies) arrays of movie reviews were acquired from the Internet Movie Database. Both ngram and naive Bayes classifiers were used to classify along both the au...

متن کامل

Enhancing Authorship Attribution By Utilizing Syntax Tree Profiles

The aim of modern authorship attribution approaches is to analyze known authors and to assign authorships to previously unseen and unlabeled text documents based on various features. In this paper we present a novel feature to enhance current attribution methods by analyzing the grammar of authors. To extract the feature, a syntax tree of each sentence of a document is calculated, which is then...

متن کامل

An Extremely Simple Authorship Attribution System

In this paper we present a very simple yet effective algorithm for authorship attribution. By this term we mean the act of telling whether a certain text was or was not written by a certain author. We shall not discuss the advantages or applications of this activity, but we propose a method for doing it in an automatic and instantaneous way, neither considering the language of the texts nor und...

متن کامل

Authorship Identification of E-mail as a Multi-Class Task - Notebook for PAN at CLEF 2011

In this paper, we describe a multi-class text categorization approach to authorship attribution and test it on sets of e-mail collections. The PAN 2011 competition data consists of e-mails of variable length, written by various candidate authors, with some represented by significantly longer or more e-mails than others. Rather than construct a classifier for each separate author to discriminate...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016